motivating case study
trees and partitions
growing and pruning
tuning and stability
The study r Citep(bib, "GERBER2008") were interested in how social pressure influences voter turnout. They defined four types of mailers and sent them randomly to potential voters in the Michigan 2006 general election.
On average, those who received the neighbor intervention were more likely to vote.
But are some groups more strongly influenced than others? The statistical term for this is “treatment effect heterogeneity.”
Intuitively, there are some people who are already so jaded (or motivated) that the intervention is unlikely to change their behavior.
Formulation.
Goal. Find subsets of \(\mathcal{X}\) where \(\tau\left(x\right)\) is large: \[\begin{align*} \tau\left(x\right) := \mu\left(x, 1\right) - \mu\left(x, 0\right) \end{align*}\]
Modeling. We need a statistical model that partitions \(\mathcal{X}\) into interpretable subgroups where \(\tau\left(x\right)\) is large.
A tree is a partition \(\mathcal{X} = \cup_{j=1}^J R_j\) where each region \(R_j\) carries constant prediction \(\hat{y}_j\).
Geometry: Axis-aligned recursive bisection. Each split: \(\{x: X_j < s\} \cup \{x: X_j \geq s\}\).
Prediction in \(R_j\):
Regression: Residual sum of squares \[\text{RSS}(T) = \sum_{j=1}^{J} \sum_{i \in R_j} (y_i - \hat{y}_j)^2\]
Classification: Impurity measures (entropy-based)
These measure class mixing. Small values indicate node purity—observations from predominantly a single class.
Greedy algorithm: At each step, select predictor \(X_j\) and cutpoint \(s\) to minimize RSS (regression) or impurity (classification).
Define half-planes: \[R_1(j,s) = \{X|X_j < s\}, \quad R_2(j,s) = \{X|X_j \geq s\}\]
Choose \(j\) and \(s\) minimizing: \[\sum_{i: x_i \in R_1(j,s)} (y_i - \hat{y}_{R_1})^2 + \sum_{i: x_i \in R_2(j,s)} (y_i - \hat{y}_{R_2})^2\]
Repeat: split one of the existing regions until stopping criterion (minimum node size) reached.
Split gain: \[\Delta I = I_{\text{parent}} - \left(\frac{n_L}{n} I_L + \frac{n_R}{n} I_R\right)\]
Numerical: Split at \(X_j < s\). Search over observed values.
Categorical: For \(K\) levels, partition into two subsets. There are \(2^{K-1}-1\) possible binary splits.
Missing values: Use surrogate splits—alternate predictors producing similar partitions.
Use \(K\)-fold cross-validation to select optimal \(\alpha\).
Large trees overfit. For a given \(\alpha \geq 0\), find a subtree \(T \subset T_0\) minimizing: \[\sum_{m=1}^{|T|} \sum_{i: x_i \in R_m} (y_i - \hat{y}_{R_m})^2 + \alpha |T|\] where \(|T|\) counts terminal nodes.
\(\alpha\) penalizes complexity. When \(\alpha = 0\), optimal subtree is \(T_0\). As \(\alpha\) increases, optimal subtree shrinks.
Weakest link pruning. As \(\alpha\) increases from zero, branches are removed in nested fashion. Computing the entire sequence of optimal subtrees is efficient.
Select \(\alpha\) by cross-validation. Standard lasso-type tradeoff.
Axis-alignment: Splits perpendicular to coordinate axes. Capturing diagonal boundaries requires many nodes – have to approximat a smooth functions with a “staircase.”
Instability: Small data perturbations can change the root split, hence the entire tree structure. High variance estimator.
Correlated predictors: Among \(X_i \approx X_j\), choice is arbitrary. Importance measures become meaningless.
Piecewise constant on axis-aligned rectangles vs. global plane.
Tree model: \[f(X) = \sum_{m=1}^M c_m \cdot \mathbb{1}_{(X \in R_m)}\]
Linear model: \[f(X) = \beta_0 + \sum_{j=1}^p X_j \beta_j\]
The original motivation were trees that doctor’s had created informally r Citep(bib, "Breiman2017").
Respond to [Split Choice Visual Explanation] in the exercise sheet.